The goal of this EDA is to understand different parameters which impacts an individual’s choice of Airpot or Airline.
Following needs to be prior to any analysis:
Load two dataset provided with this project.
dat1 <- read_excel("../raw_data/airport_choice_survey_EN_ver2.0_Capstone.xlsx", sheet = 1)
dat2 <- read_excel("../raw_data/Airport_Airline_data.xlsx", sheet = 1)
airline_airfare <- read_excel("../raw_data/Airport_Airline_data.xlsx", sheet = 2)
provincial_distances <- read_excel("../raw_data/Airport_Airline_data.xlsx", sheet = 3)
Summarization of the given data
# Examine data structure
str(dat1)
## Classes 'tbl_df', 'tbl' and 'data.frame': 488 obs. of 28 variables:
## $ ID : num 1 2 3 4 5 6 7 8 9 10 ...
## $ Airport : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Airline : num 1 1 1 1 1 1 1 1 3 1 ...
## $ Age : num 49 49 25 29 34 43 43 52 24 52 ...
## $ Gender : num 1 2 1 1 2 1 2 1 2 2 ...
## $ Nationality : num 1 1 1 1 1 1 1 1 1 1 ...
## $ TripPurpose : num 2 1 1 2 2 4 4 1 1 1 ...
## $ TripDuration : num 7 4 10 7 4 40 40 7 4 5 ...
## $ FlyingCompanion : num 0 4 2 2 0 1 1 1 1 1 ...
## $ ProvinceResidence : num 3 3 3 3 3 3 3 3 3 5 ...
## $ GroupTravel : num 2 2 2 2 2 2 2 2 2 1 ...
## $ NoTripsLastYear : num 3 1 1 2 2 4 0 0 0 2 ...
## $ FrequentFlightDestination: chr "1" "1" "2" "5" ...
## $ Destination : num 3 2 1 2 1 1 1 2 3 3 ...
## $ FlightNo : chr "KE627" "KE703" "KE897" "KE725" ...
## $ DepartureHr : chr "12" "10" "11" "15" ...
## $ DepartureMn : num 5 10 30 20 0 45 45 15 35 0 ...
## $ DepartureTime : num 2 1 1 2 2 2 2 1 3 3 ...
## $ SeatClass : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Airfare : num 80 41 NA 40 40 34 34 56 34 NA ...
## $ NoTransport : num 1 1 2 1 1 1 1 1 2 2 ...
## $ ModeTransport : num 6 6 4 1 6 6 6 6 2 10 ...
## $ AccessCost : num 8000 8000 1000 NA 8000 8000 8000 8000 2400 90000 ...
## $ AccessTime : num 40 50 20 NA 50 50 50 50 5 60 ...
## $ Occupation : num 1 9 12 8 1 1 9 2 12 12 ...
## $ Income : num 5 3 NA 7 3 3 3 3 NA NA ...
## $ MileageAirline : chr "1" NA NA "2" ...
## $ Mileage : num 150000 NA NA 100000 NA NA NA NA NA NA ...
summary(dat1)
## ID Airport Airline Age
## Min. : 1.0 Min. :1.00 Min. :1.000 Min. :17.00
## 1st Qu.:122.8 1st Qu.:1.00 1st Qu.:1.000 1st Qu.:29.00
## Median :244.5 Median :2.00 Median :2.000 Median :38.00
## Mean :244.5 Mean :1.51 Mean :2.423 Mean :39.97
## 3rd Qu.:366.2 3rd Qu.:2.00 3rd Qu.:4.000 3rd Qu.:50.00
## Max. :488.0 Max. :2.00 Max. :4.000 Max. :80.00
## NA's :10 NA's :1
## Gender Nationality TripPurpose TripDuration
## Min. :1.000 Min. :1.000 Min. :1.000 Min. : 0.00
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.: 4.00
## Median :2.000 Median :1.000 Median :1.000 Median : 5.00
## Mean :1.528 Mean :1.484 Mean :1.518 Mean : 27.44
## 3rd Qu.:2.000 3rd Qu.:1.000 3rd Qu.:2.000 3rd Qu.: 8.00
## Max. :2.000 Max. :5.000 Max. :4.000 Max. :730.00
## NA's :3
## FlyingCompanion ProvinceResidence GroupTravel NoTripsLastYear
## Min. : 0.00 Min. :1.000 Min. :1.000 Min. : 0.000
## 1st Qu.: 1.00 1st Qu.:1.000 1st Qu.:2.000 1st Qu.: 1.000
## Median : 2.00 Median :3.000 Median :2.000 Median : 2.000
## Mean : 2.82 Mean :3.391 Mean :1.826 Mean : 3.262
## 3rd Qu.: 3.00 3rd Qu.:5.000 3rd Qu.:2.000 3rd Qu.: 3.000
## Max. :34.00 Max. :8.000 Max. :2.000 Max. :122.000
##
## FrequentFlightDestination Destination FlightNo
## Length:488 Min. :1.000 Length:488
## Class :character 1st Qu.:1.000 Class :character
## Mode :character Median :2.000 Mode :character
## Mean :2.176
## 3rd Qu.:3.000
## Max. :4.000
## NA's :5
## DepartureHr DepartureMn DepartureTime SeatClass
## Length:488 Min. : 0.00 Min. :1.000 Min. :1.000
## Class :character 1st Qu.:15.00 1st Qu.:2.000 1st Qu.:1.000
## Mode :character Median :30.00 Median :2.000 Median :1.000
## Mean :25.98 Mean :2.432 Mean :1.122
## 3rd Qu.:40.00 3rd Qu.:3.000 3rd Qu.:1.000
## Max. :55.00 Max. :4.000 Max. :3.000
## NA's :120 NA's :4
## Airfare NoTransport ModeTransport AccessCost
## Min. : 3.00 Min. :1.000 Min. : 1.000 Min. : 0
## 1st Qu.: 35.00 1st Qu.:1.000 1st Qu.: 2.000 1st Qu.: 2000
## Median : 45.00 Median :1.000 Median : 3.000 Median : 6000
## Mean : 50.46 Mean :1.334 Mean : 3.801 Mean : 11220
## 3rd Qu.: 60.00 3rd Qu.:2.000 3rd Qu.: 6.000 3rd Qu.: 12000
## Max. :260.00 Max. :4.000 Max. :11.000 Max. :350000
## NA's :155 NA's :197
## AccessTime Occupation Income MileageAirline
## Min. : 4.00 Min. : 1.000 Min. :1.000 Length:488
## 1st Qu.: 25.00 1st Qu.: 2.000 1st Qu.:2.000 Class :character
## Median : 40.00 Median : 8.000 Median :3.000 Mode :character
## Mean : 51.83 Mean : 6.768 Mean :3.126
## 3rd Qu.: 60.00 3rd Qu.:12.000 3rd Qu.:4.000
## Max. :390.00 Max. :12.000 Max. :7.000
## NA's :97 NA's :132
## Mileage
## Min. : 1
## 1st Qu.: 8258
## Median : 27998
## Mean : 56384
## 3rd Qu.: 61808
## Max. :500000
## NA's :398
We have so many missing values in survey data:
Changing the survey data to make it easier to read and use during EDA process
Modifying the Airport Name for purpose of analysis
# 1. Inchoen (ICN) Airport
# 2. Gimpo (GMP) Airport
# No missing values
dat1 <- dat1 %>%
mutate(Airport = case_when(Airport == 1 ~ "ICN",
Airport == 2 ~ "GMP"))
Modifying Airline Choice Names
# 1. Korean Air(KE)
# 2. Asiana Airlines (OZ)
# 3. Korean LCC
# 4. Foreign Airlines
# Replace missing values
dat1 <- dat1 %>%
mutate(Airline = ifelse(!is.na(Airline), Airline, ifelse(str_detect(FlightNo, "KE"), "Korean Air", "Unknown"))) %>%
filter(Airline != "Unknown")
dat1 <- dat1 %>%
mutate(Airline = case_when(Airline == 1 ~ "Korean Air",
Airline == 2 ~ "Asiana Airlines",
Airline == 3 ~ "Korean LCC",
Airline == 4 ~ "Foreign Airlines"))
Modifying Gender
dat1 <- dat1 %>%
mutate(Gender = if_else(Gender == 1, "Male","Female","Not Mentioned"))
Modifying Nationality
# 1. Korea
# 2. China
# 3. Japan
# 4. Southeast Asia
# 5. Other
dat1 <- dat1 %>%
mutate(Nationality = case_when(Nationality == 1 ~ "Korea",
Nationality == 2 ~ "China",
Nationality == 3 ~ "Japan",
Nationality == 4 ~ "Southeast Asia",
Nationality == 5 ~ "Other"))
Modifying Purpose of Trip
# 1. Leisure Travel (sightseeing, visiting family, vacation, shopping)
# 2. Business Travel
# 3. Study
# 4. Other
dat1 <- dat1 %>%
mutate(TripPurpose = case_when(TripPurpose == 1 ~ "Leisure",
TripPurpose == 2 ~ "Business",
TripPurpose == 3 ~ "Study",
TripPurpose == 4 ~ "Other"))
Modifying Province Residence
provincial_distances <- provincial_distances %>%
separate(Province, c("ProvinceResidence", "Province"), sep = "\\.", remove=FALSE)
# Join the two datasets on airport name and province ids
dat1 <- dat1 %>%
mutate(ProvinceResidence = as.character(ProvinceResidence)) %>%
left_join(provincial_distances, by = c("Airport", "ProvinceResidence")) %>%
select(-c(ProvinceResidence)) %>%
mutate(Province = as.factor(Province),
Airport = as.factor(Airport))
Modifying Group Travel
# 1. Yes
# 2. No
dat1 <- dat1 %>%
mutate(GroupTravel = case_when(GroupTravel == 1 ~ 1, GroupTravel == 2 ~ 0))
Modifying Frequent Travel Destinations
# 1. Southeast Asia
# 2. China
# 3. Japan
# 4. North/South America
# 5. Europe
# 6. Other
# 7. None
dat1 <- dat1 %>%
mutate(FrequentFlightDestination = case_when(FrequentFlightDestination == 1 ~ "Southeast Asia",
FrequentFlightDestination == 2 ~ "China",
FrequentFlightDestination == 3 ~ "Japan",
FrequentFlightDestination == 4 ~ "North/South America",
FrequentFlightDestination == 5 ~ "Europe",
FrequentFlightDestination == 6 ~ "Other",
FrequentFlightDestination == 7 ~ "None"))
Modifying Flight Destinations
# 1. China
# 2. Japan
# 3. Southeast Asia
# 4. Other
dat1 <- dat1 %>%
mutate(Destination = case_when(Destination == 1 ~ "China",
Destination == 2 ~ "Japan",
Destination == 3 ~ "Southeast Asia",
Destination == 4 ~ "Other"))
Modifying Departure Time
# 1. 6am - 12pm
# 2. 12pm - 6pm
# 3. 6pm - 9pm
# 4. 9pm - 6am
dat1 <- dat1 %>%
mutate(DepartureTime = case_when(DepartureTime == 1 ~ "6am - 12pm",
DepartureTime == 2 ~ "12pm - 6pm",
DepartureTime == 3 ~ "6pm - 9pm",
DepartureTime == 4 ~ "9pm - 6am"))
Modifying Seat Class
# 1. Economy
# 2. Business
# 3. First Class
dat1 <- dat1 %>%
mutate(SeatClass = case_when(SeatClass == 1 ~ "Economy",
SeatClass == 2 ~ "Business",
SeatClass == 3 ~ "First Class"))
Modifying Mode of Transport
# 1. Car
# 2. Taxi
# 3. Bus
# 4. Express Bus
# 5. Subway
# 6. Airport Limousine
# 7. Rail
# 8. KTX (Express Rail)
# 9. Shuttle provided by travel agency
# 10. Air
# 11. Other
dat1 <- dat1 %>%
mutate(ModeTransport = case_when(ModeTransport == 1 ~ "Private Car",
ModeTransport == 2 ~ "Private Taxi",
ModeTransport == 3 ~ "Public Bus",
ModeTransport == 4 ~ "Public Bus",
ModeTransport == 5 ~ "Subway",
ModeTransport == 6 ~ "Private Taxi",
ModeTransport == 7 ~ "Public Train",
ModeTransport == 8 ~ "Public Train",
ModeTransport == 9 ~ "Private Taxi",
ModeTransport == 11 ~ "Other")) %>%
filter(ModeTransport != 10)
# Regroup into maximum 3-4 buckets
Modifying Age and add them into groups
#Youth (15-24 years)
#Adults (25-64 years)
#Seniors (65 years and over)
# dat1 <- dat1 %>%
# mutate(AgeGroup = case_when(Age >=15 & Age <= 24 ~ "15-24",
# Age >=25 & Age <= 34 ~ "25-34",
# Age >=35 & Age <= 44 ~ "35-44",
# Age >=45 & Age <= 54 ~ "45-54",
# Age >=55 & Age <= 64 ~ "55-64",
# TRUE ~ "Senior"))
Modifying Occupation
# 1. Entrepreneur, Senior management
# 2. Business (Corporate worker)
# 3. Sales, Service
# 4. Government, Military
# 5. Professionals (doctor, lawyer, professor)
# 6. industrial, manufacturing
# 7. self-employeed
# 8. Student
# 9. Housewife
# 10. Retired
# 11. None
# 12. Other
Modifying Income Class
# 1. 30 Million Won or less
# 2. 30~50 Million Won
# 3. 50~80 Million Won
# 4. 80 ~100 Million Won
# 5. 100 ~150 Million Won
# 6. 150 ~200 Million Won
# 7. 200 Million Won or more
Modifying Mileage Airline
# 1. Korea Airlines
# 2. Asian Airlines
# 3. Others
# 4. None
# 5. Unknown Missing Value
dat1 <- dat1 %>%
mutate(MileageAirline = case_when(MileageAirline == 1 ~ "Korea Airlines",
MileageAirline == 2 ~ "Asian Airlines",
MileageAirline == 3 ~ "Others",
MileageAirline == 4 ~ "None",
MileageAirline == 5 ~ "Unknown Missing Value"))
#dat1 <- dat1 %>%
# mutate(MileageAirline = ifelse(is.na(Mileage), 'No Mileage', 'Mileage'))
Replacing missing values in age
# Check distribution of age
hist(dat1$Age)
age_mean = mean(dat1$Age, na.rm = TRUE)
dat1$Age[is.na(dat1$Age)] <- age_mean
Not relevant
Replacing missing values in Airfare
Airport_Airline_data -> raw data which can be used to treat missing values# Check distribution of Airfare
hist(dat1$Airfare)
# change column name for join
names(airline_airfare)[names(airline_airfare) == 'Flight.Destination'] <- 'Destination'
dat1 = dat1 %>%
left_join(airline_airfare, by = c("Airport", "Airline", "Destination")) %>%
mutate(Airfare = ifelse(is.na(dat1$Airfare), `Average Price`/10000, Airfare))
# fill remaining from survey data average of group of -- Airport, Airlines, Destination, seat type
airfare_average = dat1 %>%
group_by(Airport, Airline, Destination, SeatClass) %>%
summarise(a_fare=mean(Airfare, na.rm = TRUE))
dat1 = dat1 %>%
left_join(airfare_average, by = c("Airport", "Airline", "Destination", "SeatClass")) %>%
mutate(Airfare = ifelse(is.na(dat1$Airfare), a_fare, Airfare))
Replacing missing values in Access Cost
# Check distribution of AccessCost
hist(dat1$AccessCost)
Airport_Airline_data -> raw data so these can not be treatedReplacing missing values in Access time
Airport_Airline_data -> raw data which can be used to treat missing values# Check distribution of AccessTime
hist(dat1$AccessTime)
Replace missing values in Access time
dat1 = dat1 %>%
mutate(AccessTime = ifelse(is.na(dat1$AccessTime),
`Travel Time, minutes`,
AccessTime))
Replacing missing values in Mileage
Airport_Airline_data -> raw data which can be used to treat missing values# Check distribution of Mileage
hist(dat1$Mileage)
Airport_Airline_data -> raw data so these missings can not be treated# There are only 25 rows without NA, so instead of removing NAs replace it with 999
dat1[is.na(dat1)] <- 999
# Remove ID column and Flight No column as they do not provide any benefit in this analysis because of their cardinal
dat1 <- dat1 %>%
select(-c(FlightNo, ID))
# Remove Departure Hr and Min as information is already captured in Departure Time
dat1 <- dat1 %>%
select(-c(DepartureHr, DepartureMn))
# Remove Occupation - Not relevant for this study
dat1 <- dat1 %>%
select(-c(Occupation))
# Remove Mileage - 398 missing values
# Remove MileageAirline - Missing values or unknown values
dat1 <- dat1 %>%
select(-c(Mileage, MileageAirline))
# Remove AccessCost - Lot of missing values
# Remove Income - missing values
dat1 <- dat1 %>%
select(-c(Income))
# Not Relevant for our analysis
dat1 <- dat1 %>%
select(-c(FlyingCompanion, NoTripsLastYear))
# Modify data
airport_survey_data <- dat1 %>%
mutate(Airport = as.factor(Airport),
Airline = as.factor(Airline),
Nationality = as.factor(Nationality),
TripPurpose = as.factor(TripPurpose),
Province = as.factor(Province),
Destination = as.factor(Destination),
DepartureTime = as.factor(DepartureTime),
SeatClass = as.factor(SeatClass),
NoTransport = as.factor(NoTransport),
ModeTransport = as.factor(ModeTransport),
Gender = as.factor(Gender))
summary(airport_survey_data)
## Airport Airline Age Gender
## GMP:245 Asiana Airlines :106 Min. :17.00 Female :249
## ICN:229 Foreign Airlines:137 1st Qu.:29.00 Male :222
## Korean Air :150 Median :37.00 Not Mentioned: 3
## Korean LCC : 81 Mean :39.83
## 3rd Qu.:50.00
## Max. :80.00
##
## Nationality TripPurpose TripDuration GroupTravel
## China : 32 Business:102 Min. : 0.00 Min. :0.0000
## Japan : 40 Leisure :317 1st Qu.: 4.00 1st Qu.:0.0000
## Korea :368 Other : 29 Median : 5.00 Median :0.0000
## Other : 14 Study : 26 Mean : 24.59 Mean :0.1624
## Southeast Asia: 20 3rd Qu.: 8.00 3rd Qu.:0.0000
## Max. :730.00 Max. :1.0000
##
## FrequentFlightDestination Destination DepartureTime
## Length:474 999 : 2 12pm - 6pm:205
## Class :character China :131 6am - 12pm: 46
## Mode :character Japan :155 6pm - 9pm :194
## Other : 22 9pm - 6am : 29
## Southeast Asia:164
##
##
## SeatClass Airfare NoTransport ModeTransport
## 999 : 4 Min. : 3.00 1:339 Other : 2
## Business : 34 1st Qu.: 36.22 2:120 Private Car :106
## Economy :424 Median : 46.83 3: 14 Private Taxi:176
## First Class: 12 Mean : 61.20 4: 1 Public Bus : 50
## 3rd Qu.: 58.00 Public Train: 39
## Max. :999.00 Subway :101
##
## AccessCost AccessTime Distance Province
## Min. : 0 Min. : 4.00 Min. : 23.72 Seoul :182
## 1st Qu.: 999 1st Qu.: 30.00 1st Qu.: 46.00 Kyungki-do :126
## Median : 1300 Median : 50.00 Median : 61.00 Jeju : 89
## Mean : 6785 Mean : 62.36 Mean :154.93 Kyungsang-do : 27
## 3rd Qu.: 8000 3rd Qu.: 64.00 3rd Qu.:363.00 Incheon : 22
## Max. :350000 Max. :390.00 Max. :455.00 Chungcheong-do: 13
## (Other) : 15
## Travel Time, minutes Average Price a_fare
## Min. : 39 Min. : 999 Min. : 14.00
## 1st Qu.: 56 1st Qu.: 999 1st Qu.: 42.17
## Median : 64 Median : 999 Median : 46.83
## Mean :103 Mean :166373 Mean : 61.20
## 3rd Qu.:180 3rd Qu.:362160 3rd Qu.: 52.72
## Max. :311 Max. :529260 Max. :999.00
##
attach(airport_survey_data)
# Selecting relevant columns for further developement of models in Python for analysis
# airport_choice <- airport_survey_data %>%
# select(Airport, Airline,AccessTime, DepartureTime,
# Age,Destination,Distance,AccessCost)
#
# airline_choice <- airport_survey_data %>%
# select(Airline,Airport,Airfare,DepartureTime,Destination)
# write.csv(airport_choice, "../code/airport_choice.csv")
# write.csv(airline_choice, "../code/airline_choice.csv")
write.csv(airport_survey_data, "../code/airport_survey_data.csv")
# Function to get the relevant grouped data frames
# df -> dataframe
# groupVariable -> the group by column
# output -> the resulting data frame
getGroupedUnivariateData <- function(df, groupVariable) {
# group_by() grouped variable
# summarise() reduces variable to descriptive stat
# count is the new variable name, n() is a counting function
# percent is a new variable, sum() and nrow() are functions
# tot_selling_price -> total selling price per grouped variable
# order_by() tot_selling_price
result <- df %>%
group_by_(groupVariable) %>%
summarise(count = n(),
percent = (sum(count) / nrow(df)) * 100,
avg_airfare = (mean(Airfare)),
avg_accessCost = (mean(AccessCost)),
avg_accessTime = (mean(AccessTime))) %>%
arrange(desc(percent)) %>%
as.data.frame()
return(result)
}
# count, average access cost and average access time by airport
airport_data_airport_type <- getGroupedUnivariateData(airport_survey_data, "Airport")
airport_data_airport_type %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
row_spec(1, background = "#e5f5e0")
| Airport | count | percent | avg_airfare | avg_accessCost | avg_accessTime |
|---|---|---|---|---|---|
| GMP | 245 | 51.68776 | 53.56583 | 6207.988 | 53.94694 |
| ICN | 229 | 48.31224 | 69.37374 | 7401.502 | 71.35808 |
Observations
# count, average access cost and average access time by airline
airport_data_airline_type <- getGroupedUnivariateData(airport_survey_data, "Airline")
airport_data_airline_type %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
row_spec(1, background = "#e5f5e0")
| Airline | count | percent | avg_airfare | avg_accessCost | avg_accessTime |
|---|---|---|---|---|---|
| Korean Air | 150 | 31.64557 | 56.99893 | 8698.613 | 62.55333 |
| Foreign Airlines | 137 | 28.90295 | 75.49061 | 5087.482 | 56.16788 |
| Asiana Airlines | 106 | 22.36287 | 54.11425 | 7168.349 | 69.98113 |
| Korean LCC | 81 | 17.08861 | 54.09939 | 5608.383 | 62.49383 |
Observations
# count, average access cost and average access time by gender
airport_data_gender <- getGroupedUnivariateData(airport_survey_data, "Gender")
airport_data_gender %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
row_spec(1, background = "#e5f5e0")
| Gender | count | percent | avg_airfare | avg_accessCost | avg_accessTime |
|---|---|---|---|---|---|
| Female | 249 | 52.5316456 | 59.46850 | 5906.663 | 63.71084 |
| Male | 222 | 46.8354430 | 63.40278 | 7790.279 | 60.62162 |
| Not Mentioned | 3 | 0.6329114 | 42.38000 | 5233.333 | 78.66667 |
Observations
# count, average access cost and average access time by airport
airport_data_nationality <- getGroupedUnivariateData(airport_survey_data, "Nationality")
airport_data_nationality %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
row_spec(1, background = "#e5f5e0")
| Nationality | count | percent | avg_airfare | avg_accessCost | avg_accessTime |
|---|---|---|---|---|---|
| Korea | 368 | 77.637131 | 64.72743 | 6943.886 | 59.80707 |
| Japan | 40 | 8.438819 | 53.48309 | 5474.450 | 57.00000 |
| China | 32 | 6.751055 | 48.24279 | 5540.250 | 66.06250 |
| Southeast Asia | 20 | 4.219409 | 47.33056 | 9749.700 | 100.90000 |
| Other | 14 | 2.953587 | 40.05821 | 4949.357 | 81.21429 |
# count, average access cost and average access time by airport
airport_data_TripPurpose <- getGroupedUnivariateData(airport_survey_data, "TripPurpose")
airport_data_TripPurpose %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
row_spec(1, background = "#e5f5e0")
| TripPurpose | count | percent | avg_airfare | avg_accessCost | avg_accessTime |
|---|---|---|---|---|---|
| Leisure | 317 | 66.877637 | 63.29966 | 5232.372 | 61.24290 |
| Business | 102 | 21.518987 | 62.16415 | 11337.343 | 65.51961 |
| Other | 29 | 6.118144 | 45.29150 | 9199.724 | 56.37931 |
| Study | 26 | 5.485232 | 49.61629 | 5155.308 | 70.23077 |
# count, average access cost and average access time by destination
airport_data_Residence <- getGroupedUnivariateData(airport_survey_data, "Province")
airport_data_Residence %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
row_spec(1, background = "#e5f5e0")
| Province | count | percent | avg_airfare | avg_accessCost | avg_accessTime |
|---|---|---|---|---|---|
| Seoul | 182 | 38.396625 | 56.23822 | 4293.868 | 43.39011 |
| Kyungki-do | 126 | 26.582279 | 59.49765 | 7224.571 | 55.61905 |
| Jeju | 89 | 18.776371 | 81.42943 | 8015.191 | 74.59551 |
| Kyungsang-do | 27 | 5.696203 | 64.06941 | 21316.519 | 141.85185 |
| Incheon | 22 | 4.641350 | 45.64090 | 1919.773 | 33.59091 |
| Chungcheong-do | 13 | 2.742616 | 42.61244 | 7492.000 | 78.46154 |
| Jeonra-do | 10 | 2.109705 | 51.00000 | 3219.200 | 188.50000 |
| Kangwon-do | 5 | 1.054852 | 46.60000 | 12680.000 | 108.00000 |
# count, average access cost and average access time by destination
airport_data_destination <- getGroupedUnivariateData(airport_survey_data, "Destination")
airport_data_destination %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
row_spec(1, background = "#e5f5e0")
| Destination | count | percent | avg_airfare | avg_accessCost | avg_accessTime |
|---|---|---|---|---|---|
| Southeast Asia | 164 | 34.5991561 | 73.16723 | 8440.372 | 67.41463 |
| Japan | 155 | 32.7004219 | 51.03942 | 6089.342 | 55.20000 |
| China | 131 | 27.6371308 | 46.39340 | 5782.359 | 64.89313 |
| Other | 22 | 4.6413502 | 134.18831 | 4970.136 | 60.22727 |
| 999 | 2 | 0.4219409 | 35.00000 | 10500.000 | 60.00000 |
Observations
# count, average access cost and average access time by province residence
airport_data_deptime <- getGroupedUnivariateData(airport_survey_data, "DepartureTime")
airport_data_deptime %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
row_spec(1, background = "#e5f5e0")
| DepartureTime | count | percent | avg_airfare | avg_accessCost | avg_accessTime |
|---|---|---|---|---|---|
| 12pm - 6pm | 205 | 43.248945 | 52.91412 | 5753.839 | 54.15122 |
| 6pm - 9pm | 194 | 40.928270 | 64.80852 | 8519.015 | 72.23196 |
| 6am - 12pm | 46 | 9.704641 | 88.91833 | 5789.913 | 55.00000 |
| 9pm - 6am | 29 | 6.118144 | 51.71462 | 4046.172 | 66.00000 |
# count, average access cost and average access time by seat class
airport_data_seatClass <- getGroupedUnivariateData(airport_survey_data, "SeatClass")
airport_data_seatClass %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
row_spec(1, background = "#e5f5e0")
| SeatClass | count | percent | avg_airfare | avg_accessCost | avg_accessTime |
|---|---|---|---|---|---|
| Economy | 424 | 89.4514768 | 50.62049 | 6758.300 | 61.88915 |
| Business | 34 | 7.1729958 | 86.23899 | 7811.294 | 68.64706 |
| First Class | 12 | 2.5316456 | 293.16667 | 6650.000 | 66.50000 |
| 999 | 4 | 0.8438819 | 274.25000 | 1249.500 | 46.25000 |
Observations
Observations
# count, average access cost and average access time by province residence
airport_data_modeTransport <- getGroupedUnivariateData(airport_survey_data, "ModeTransport")
airport_data_modeTransport %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
row_spec(1, background = "#e5f5e0")
| ModeTransport | count | percent | avg_airfare | avg_accessCost | avg_accessTime |
|---|---|---|---|---|---|
| Private Taxi | 176 | 37.1308017 | 54.75351 | 10959.472 | 53.63068 |
| Private Car | 106 | 22.3628692 | 61.15355 | 2261.434 | 59.04717 |
| Subway | 101 | 21.3080169 | 62.00375 | 1521.762 | 51.53465 |
| Public Bus | 50 | 10.5485232 | 65.36405 | 2970.780 | 60.36000 |
| Public Train | 39 | 8.2278481 | 83.70869 | 17156.051 | 139.92308 |
| Other | 2 | 0.4219409 | 48.05000 | 37999.500 | 90.00000 |
Observations
# Code histograms using grid.arrange so can see all quant variables together
grid.arrange(
airport_data_airport_type %>%
ggplot(aes(x = reorder(Airport, -count), y = count)) +
xlab("Airport") +
theme_classic() +
geom_bar(stat = "identity", fill = "darkolivegreen3"),
airport_data_airline_type %>%
ggplot(aes(x = reorder(Airline, -count), y = count)) +
xlab("Airline") +
theme_classic() +
geom_bar(stat = "identity", fill = "darkolivegreen3"),
airport_data_gender %>%
ggplot(aes(x = reorder(Gender, -count), y = count)) +
xlab("Gender") +
theme_classic() +
geom_bar(stat = "identity", fill = "darkolivegreen3"),
airport_data_seatClass %>%
ggplot(aes(x = reorder(SeatClass, -count), y = count)) +
xlab("SeatClass") +
theme_classic() +
geom_bar(stat = "identity", fill = "darkolivegreen3"),
airport_data_Residence %>%
ggplot(aes(x = reorder(Province, -count), y = count)) +
xlab("Province") +
theme_classic() +
geom_bar(stat = "identity", fill = "darkolivegreen3"),
airport_data_destination %>%
ggplot(aes(x = reorder(Destination, -count), y = count)) +
xlab("Destination") +
theme_classic() +
geom_bar(stat = "identity", fill = "darkolivegreen3"),
ncol = 2
)
Observations
# Get descriptive stats
summary(airport_survey_data)
## Airport Airline Age Gender
## GMP:245 Asiana Airlines :106 Min. :17.00 Female :249
## ICN:229 Foreign Airlines:137 1st Qu.:29.00 Male :222
## Korean Air :150 Median :37.00 Not Mentioned: 3
## Korean LCC : 81 Mean :39.83
## 3rd Qu.:50.00
## Max. :80.00
##
## Nationality TripPurpose TripDuration GroupTravel
## China : 32 Business:102 Min. : 0.00 Min. :0.0000
## Japan : 40 Leisure :317 1st Qu.: 4.00 1st Qu.:0.0000
## Korea :368 Other : 29 Median : 5.00 Median :0.0000
## Other : 14 Study : 26 Mean : 24.59 Mean :0.1624
## Southeast Asia: 20 3rd Qu.: 8.00 3rd Qu.:0.0000
## Max. :730.00 Max. :1.0000
##
## FrequentFlightDestination Destination DepartureTime
## Length:474 999 : 2 12pm - 6pm:205
## Class :character China :131 6am - 12pm: 46
## Mode :character Japan :155 6pm - 9pm :194
## Other : 22 9pm - 6am : 29
## Southeast Asia:164
##
##
## SeatClass Airfare NoTransport ModeTransport
## 999 : 4 Min. : 3.00 1:339 Other : 2
## Business : 34 1st Qu.: 36.22 2:120 Private Car :106
## Economy :424 Median : 46.83 3: 14 Private Taxi:176
## First Class: 12 Mean : 61.20 4: 1 Public Bus : 50
## 3rd Qu.: 58.00 Public Train: 39
## Max. :999.00 Subway :101
##
## AccessCost AccessTime Distance Province
## Min. : 0 Min. : 4.00 Min. : 23.72 Seoul :182
## 1st Qu.: 999 1st Qu.: 30.00 1st Qu.: 46.00 Kyungki-do :126
## Median : 1300 Median : 50.00 Median : 61.00 Jeju : 89
## Mean : 6785 Mean : 62.36 Mean :154.93 Kyungsang-do : 27
## 3rd Qu.: 8000 3rd Qu.: 64.00 3rd Qu.:363.00 Incheon : 22
## Max. :350000 Max. :390.00 Max. :455.00 Chungcheong-do: 13
## (Other) : 15
## Travel Time, minutes Average Price a_fare
## Min. : 39 Min. : 999 Min. : 14.00
## 1st Qu.: 56 1st Qu.: 999 1st Qu.: 42.17
## Median : 64 Median : 999 Median : 46.83
## Mean :103 Mean :166373 Mean : 61.20
## 3rd Qu.:180 3rd Qu.:362160 3rd Qu.: 52.72
## Max. :311 Max. :529260 Max. :999.00
##
Observations
# Code histograms using ggplot() so we can see all quant variables
grid.arrange(
airport_survey_data %>%
ggplot(aes(Age)) +
geom_histogram(fill = "darkolivegreen3") +
xlab("Age (in years)")+
theme_classic(),
airport_survey_data %>%
ggplot(aes(Airfare)) +
geom_histogram(fill = "darkolivegreen3") +
xlab("Airfare (in 10,000 KRW)")+
theme_classic(),
airport_survey_data %>%
ggplot(aes(AccessCost/1000)) +
geom_histogram(fill = "darkolivegreen3") +
xlab("Transportation Cost to the Airport (in 1000 KRW)") +
theme_classic() +
xlim(0, 100) +
ylim(0, 100),
airport_survey_data %>%
ggplot(aes(AccessTime)) +
geom_histogram(fill = "darkolivegreen3") +
xlab("Transportation Time to the Airport (in minutes)")+
theme_classic()
)
# get box plot of all the quanitative variables
par(mfrow = c(2, 2))
boxplot(Age, main = "Age (in years)")
boxplot(Airfare, main = "Airfare (in 10,000 KRW)")
boxplot(AccessCost/1000, main = "Transportation Cost to Airport(1000KRW)")
boxplot(AccessTime, main = "Transportation Time to Airport(in min)")
# Frequency count
dat1 %>%
tabyl(Airport, Airline)
## Airport Asiana Airlines Foreign Airlines Korean Air Korean LCC
## GMP 52 93 76 24
## ICN 54 44 74 57
# Make cross-tabs between two categorical variables using janitor package
# cross-tab of bedrooms and floors
airport_survey_data %>%
tabyl(Airline, Airport) %>%
adorn_totals(where = c("row", "col")) %>%
adorn_percentages(denominator = "all") %>%
adorn_pct_formatting(digits = 2) %>%
adorn_title() %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
row_spec(1, background = "#e5f5e0")
| Airport | |||
|---|---|---|---|
| Airline | GMP | ICN | Total |
| Asiana Airlines | 10.97% | 11.39% | 22.36% |
| Foreign Airlines | 19.62% | 9.28% | 28.90% |
| Korean Air | 16.03% | 15.61% | 31.65% |
| Korean LCC | 5.06% | 12.03% | 17.09% |
| Total | 51.69% | 48.31% | 100.00% |
# Frequency count
dat1 %>%
tabyl(Airport, Gender)
## Airport Female Male Not Mentioned
## GMP 134 110 1
## ICN 115 112 2
# cross-tab of bedrooms and waterfront
airport_survey_data %>%
tabyl(Gender, Airport) %>%
adorn_totals(where = c("row", "col")) %>%
adorn_percentages(denominator = "all") %>%
adorn_pct_formatting(digits = 2)%>%
adorn_title() %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
row_spec(1, background = "#e5f5e0")
| Airport | |||
|---|---|---|---|
| Gender | GMP | ICN | Total |
| Female | 28.27% | 24.26% | 52.53% |
| Male | 23.21% | 23.63% | 46.84% |
| Not Mentioned | 0.21% | 0.42% | 0.63% |
| Total | 51.69% | 48.31% | 100.00% |
# Frequency count
dat1 %>%
tabyl(Airport, Nationality)
## Airport China Japan Korea Other Southeast Asia
## GMP 26 39 171 7 2
## ICN 6 1 197 7 18
# Frequency count
dat1 %>%
tabyl(Airport, Province)
## Airport Chungcheong-do Incheon Jeju Jeonra-do Kangwon-do Kyungki-do
## GMP 6 11 69 5 3 59
## ICN 7 11 20 5 2 67
## Kyungsang-do Seoul
## 5 87
## 22 95
# cross-tab of bedrooms and condition
airport_survey_data %>%
tabyl(Province, Airport) %>%
adorn_totals(where = c("row", "col")) %>%
adorn_percentages(denominator = "all") %>%
adorn_pct_formatting(digits = 2) %>%
adorn_title() %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
row_spec(1, background = "#e5f5e0")
| Airport | |||
|---|---|---|---|
| Province | GMP | ICN | Total |
| Chungcheong-do | 1.27% | 1.48% | 2.74% |
| Incheon | 2.32% | 2.32% | 4.64% |
| Jeju | 14.56% | 4.22% | 18.78% |
| Jeonra-do | 1.05% | 1.05% | 2.11% |
| Kangwon-do | 0.63% | 0.42% | 1.05% |
| Kyungki-do | 12.45% | 14.14% | 26.58% |
| Kyungsang-do | 1.05% | 4.64% | 5.70% |
| Seoul | 18.35% | 20.04% | 38.40% |
| Total | 51.69% | 48.31% | 100.00% |
# Frequency count
dat1 %>%
tabyl(Airport, Destination)
## Airport 999 China Japan Other Southeast Asia
## GMP 2 87 130 4 22
## ICN 0 44 25 18 142
# Make cross-tabs between two categorical variables using janitor package
# cross-tab of Airport and Destinationn
airport_survey_data %>%
tabyl(Destination, Airport) %>%
adorn_totals(where = c("row", "col")) %>%
adorn_percentages(denominator = "all") %>%
adorn_pct_formatting(digits = 2)%>%
adorn_title() %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
row_spec(1, background = "#e5f5e0")
| Airport | |||
|---|---|---|---|
| Destination | GMP | ICN | Total |
| 999 | 0.42% | 0.00% | 0.42% |
| China | 18.35% | 9.28% | 27.64% |
| Japan | 27.43% | 5.27% | 32.70% |
| Other | 0.84% | 3.80% | 4.64% |
| Southeast Asia | 4.64% | 29.96% | 34.60% |
| Total | 51.69% | 48.31% | 100.00% |
# cross-tab of freq flight destination and Airport
airport_survey_data %>%
tabyl(FrequentFlightDestination, Airport) %>%
adorn_totals(where = c("row", "col")) %>%
adorn_percentages(denominator = "all") %>%
adorn_pct_formatting(digits = 2) %>%
adorn_title() %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
row_spec(1, background = "#e5f5e0")
| Airport | |||
|---|---|---|---|
| FrequentFlightDestination | GMP | ICN | Total |
| 999 | 2.53% | 4.01% | 6.54% |
| China | 16.24% | 5.27% | 21.52% |
| Europe | 1.90% | 3.59% | 5.49% |
| Japan | 18.99% | 4.22% | 23.21% |
| None | 1.05% | 1.48% | 2.53% |
| North/South America | 1.27% | 1.27% | 2.53% |
| Other | 3.80% | 2.53% | 6.33% |
| Southeast Asia | 5.91% | 25.95% | 31.86% |
| Total | 51.69% | 48.31% | 100.00% |
# Frequency count
dat1 %>%
tabyl(Airport, DepartureTime)
## Airport 12pm - 6pm 6am - 12pm 6pm - 9pm 9pm - 6am
## GMP 135 35 74 1
## ICN 70 11 120 28
# Frequency count
dat1 %>%
tabyl(Airline, Gender)
## Airline Female Male Not Mentioned
## Asiana Airlines 52 53 1
## Foreign Airlines 71 66 0
## Korean Air 77 73 0
## Korean LCC 49 30 2
# cross-tab of Airline and Gender
airport_survey_data %>%
tabyl(Airline, Gender) %>%
adorn_totals(where = c("row", "col")) %>%
adorn_percentages(denominator = "all") %>%
adorn_pct_formatting(digits = 2) %>%
adorn_title() %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
row_spec(1, background = "#e5f5e0")
| Gender | ||||
|---|---|---|---|---|
| Airline | Female | Male | Not Mentioned | Total |
| Asiana Airlines | 10.97% | 11.18% | 0.21% | 22.36% |
| Foreign Airlines | 14.98% | 13.92% | 0.00% | 28.90% |
| Korean Air | 16.24% | 15.40% | 0.00% | 31.65% |
| Korean LCC | 10.34% | 6.33% | 0.42% | 17.09% |
| Total | 52.53% | 46.84% | 0.63% | 100.00% |
# Frequency count
dat1 %>%
tabyl(Airline, Nationality)
## Airline China Japan Korea Other Southeast Asia
## Asiana Airlines 5 8 84 4 5
## Foreign Airlines 18 20 90 4 5
## Korean Air 9 10 123 4 4
## Korean LCC 0 2 71 2 6
# cross-tab of Airline and Nationality
airport_survey_data %>%
tabyl(Airline, Nationality) %>%
adorn_totals(where = c("row", "col")) %>%
adorn_percentages(denominator = "all") %>%
adorn_pct_formatting(digits = 2) %>%
adorn_title() %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
row_spec(1, background = "#e5f5e0")
| Nationality | ||||||
|---|---|---|---|---|---|---|
| Airline | China | Japan | Korea | Other | Southeast Asia | Total |
| Asiana Airlines | 1.05% | 1.69% | 17.72% | 0.84% | 1.05% | 22.36% |
| Foreign Airlines | 3.80% | 4.22% | 18.99% | 0.84% | 1.05% | 28.90% |
| Korean Air | 1.90% | 2.11% | 25.95% | 0.84% | 0.84% | 31.65% |
| Korean LCC | 0.00% | 0.42% | 14.98% | 0.42% | 1.27% | 17.09% |
| Total | 6.75% | 8.44% | 77.64% | 2.95% | 4.22% | 100.00% |
# Frequency count
dat1 %>%
tabyl(Airline, Destination)
## Airline 999 China Japan Other Southeast Asia
## Asiana Airlines 0 32 47 3 24
## Foreign Airlines 0 46 53 10 28
## Korean Air 2 53 48 9 38
## Korean LCC 0 0 7 0 74
# cross-tab of Airline and Destination
airport_survey_data %>%
tabyl(Airline, Destination) %>%
adorn_totals(where = c("row", "col")) %>%
adorn_percentages(denominator = "all") %>%
adorn_pct_formatting(digits = 2) %>%
adorn_title() %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
row_spec(1, background = "#e5f5e0")
| Destination | ||||||
|---|---|---|---|---|---|---|
| Airline | 999 | China | Japan | Other | Southeast Asia | Total |
| Asiana Airlines | 0.00% | 6.75% | 9.92% | 0.63% | 5.06% | 22.36% |
| Foreign Airlines | 0.00% | 9.70% | 11.18% | 2.11% | 5.91% | 28.90% |
| Korean Air | 0.42% | 11.18% | 10.13% | 1.90% | 8.02% | 31.65% |
| Korean LCC | 0.00% | 0.00% | 1.48% | 0.00% | 15.61% | 17.09% |
| Total | 0.42% | 27.64% | 32.70% | 4.64% | 34.60% | 100.00% |
# Frequency count
dat1 %>%
tabyl(Airline, DepartureTime)
## Airline 12pm - 6pm 6am - 12pm 6pm - 9pm 9pm - 6am
## Asiana Airlines 39 4 55 8
## Foreign Airlines 60 8 61 8
## Korean Air 89 15 43 3
## Korean LCC 17 19 35 10
# cross-tab of Airline and DepartureTime
airport_survey_data %>%
tabyl(Airline, DepartureTime) %>%
adorn_totals(where = c("row", "col")) %>%
adorn_percentages(denominator = "all") %>%
adorn_pct_formatting(digits = 2) %>%
adorn_title() %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
row_spec(1, background = "#e5f5e0")
| DepartureTime | |||||
|---|---|---|---|---|---|
| Airline | 12pm - 6pm | 6am - 12pm | 6pm - 9pm | 9pm - 6am | Total |
| Asiana Airlines | 8.23% | 0.84% | 11.60% | 1.69% | 22.36% |
| Foreign Airlines | 12.66% | 1.69% | 12.87% | 1.69% | 28.90% |
| Korean Air | 18.78% | 3.16% | 9.07% | 0.63% | 31.65% |
| Korean LCC | 3.59% | 4.01% | 7.38% | 2.11% | 17.09% |
| Total | 43.25% | 9.70% | 40.93% | 6.12% | 100.00% |
# cross-tab of floors and waterfront
airport_survey_data %>%
tabyl(TripPurpose, Destination) %>%
adorn_totals(where = c("row", "col")) %>%
adorn_percentages(denominator = "all") %>%
adorn_pct_formatting(digits = 2) %>%
adorn_title() %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
row_spec(1, background = "#e5f5e0")
| Destination | ||||||
|---|---|---|---|---|---|---|
| TripPurpose | 999 | China | Japan | Other | Southeast Asia | Total |
| Business | 0.00% | 7.81% | 9.70% | 0.84% | 3.16% | 21.52% |
| Leisure | 0.21% | 14.77% | 18.78% | 3.38% | 29.75% | 66.88% |
| Other | 0.21% | 2.74% | 1.69% | 0.21% | 1.27% | 6.12% |
| Study | 0.00% | 2.32% | 2.53% | 0.21% | 0.42% | 5.49% |
| Total | 0.42% | 27.64% | 32.70% | 4.64% | 34.60% | 100.00% |
#Function for heatmap
heat_map <- function (x, y,y_title,x_title){
x<-enquo(x)
y<-enquo(y)
airport_survey_data %>%
group_by(!!x,!!y) %>%
summarise(count = n()) %>%
ggplot(aes(!!x,!!y)) +
geom_tile(aes(fill = count))+
ylab(y_title) + xlab(x_title)+
theme_classic() +
scale_fill_continuous(guide = guide_legend(title = "Count")) +
scale_fill_gradient(low = "white", high = "darkolivegreen3")
}
grid.arrange(
heat_map(Airport, Airline, "Airline", "Airport"),
heat_map(Airport, Destination, "Destination", "Airport"),
heat_map(Airport, Province, "Province", "Airport"),
heat_map(TripPurpose, Destination, "Destination", "TripPurpose"),
nrow=2)
heat_map(Airport, TripPurpose, "TripPurpose", "Airport" )
heat_map(Airline, TripPurpose, "TripPurpose", "Airline" )
heat_map(GroupTravel, Airline, "GroupTravel", "Airport" )
heat_map(GroupTravel, Airport, "GroupTravel", "Airline" )
heat_map(Gender, Airline, "Airline", "Gender" )
heat_map(Gender, Airport, "Airport", "Gender" )
ggplot(dat1, aes(x=Age, y=Airline)) +
geom_point(size=2, shape=23)
ggplot(dat1, aes(x=Age, y=Airport)) +
geom_point(size=2, shape=23)
ggplot(dat1, aes(x=Age, y=Gender, color=Airport)) +
geom_point(size=2, shape=23)
ggplot(dat1, aes(x=Age, y=Gender, color=Airline)) +
geom_point(size=2, shape=23)
ggplot(dat1, aes(x=Destination, y=Airfare, color=Airline)) +
geom_point(size=2, shape=23) + ylim(0, 270)
## Warning: Removed 6 rows containing missing values (geom_point).
The standard measure between quantitative variables is correlation
# Find correlation of quantitative variables
#cor_plot <- dat1 %>%
# select(-c(FrequentFlightDestination, -MileageAirline))
#cor(cor_plot)
# Correlation plot
dat1 %>%
select(Airport, Airline, Age, Gender, TripPurpose, TripDuration, Destination, DepartureTime, Province, Airfare, AccessCost, AccessTime) %>%
pairs.panels(hist.col="darkolivegreen3")
airport_airline_airfare_graph <- ggplot(airport_survey_data, aes(x = Airline, y = Airfare, color = Airport)) +
geom_point(alpha = 0.5) +
ylim(0, 300) +
geom_smooth(method=lm, se=FALSE) +
theme_classic()+
theme(legend.title = element_text(size=10),
plot.title = element_text(hjust = 0, face = 'bold',color = 'black'),
plot.subtitle = element_text(face = "italic")) +
labs(x = 'Airlines', y = 'Airfare (10,000 Korean Won, KRW)', title = "ICN airport has higher airfares") +
scale_fill_brewer(palette = "blue")
airport_airline_airfare_graph
airport_airline_bargraph <- airport_survey_data %>%
group_by(Airport, Airline) %>%
summarize (Avg_Fare = mean(Airfare))
ggplot(data = airport_airline_bargraph, aes(x = Airline, y = Avg_Fare, fill = Airport)) +
geom_bar(stat="identity", position="dodge") +
coord_flip() +
scale_fill_brewer() +
labs(x = 'Airlines', y = 'Airfare (10,000 Korean Won, KRW)', title = "ICN airport has higher average airfares") +
theme_classic()+
theme(legend.title = element_text(size=10),
plot.title = element_text(hjust = 0, face = 'bold',color = 'black'),
plot.subtitle = element_text(face = "italic"))
airport_gender_age_graph <- ggplot(airport_survey_data, aes(x = Gender, y = Age, color = Airport)) +
geom_point(alpha = 0.5) +
ylim(0, 85) +
geom_smooth(method=lm, se=FALSE, color="black") +
theme_classic()+
theme(legend.title = element_text(size=10),
plot.title = element_text(hjust = 0, face = 'bold',color = 'black'),
plot.subtitle = element_text(face = "italic")) +
labs(x = 'Gender', y = 'Age (Years)', title = "Older people prefer GMP airport") +
guides(color = guide_legend(title = ' Airports')) +
scale_fill_brewer()
airport_gender_age_graph
airport_gender_age_bargraph <- airport_survey_data %>%
group_by(Airport, Gender) %>%
summarize (Avg_age = mean(Age))
ggplot(data = airport_gender_age_bargraph, aes(x = Gender, y = Avg_age, fill = Airport)) +
geom_bar(stat="identity", position="dodge") +
coord_flip() +
scale_fill_brewer() +
labs(x = 'Gender', y = 'Age (Years)', title = "Among surveyed, average age of people going to GMP is higher") +
theme_classic()+
theme(legend.title = element_text(size=10),
plot.title = element_text(hjust = 0, face = 'bold',color = 'black'),
plot.subtitle = element_text(face = "italic"))
airport_destination_bargraph <- airport_survey_data %>%
group_by(Airport, Destination) %>%
summarize (Count = n())
ggplot(data = airport_destination_bargraph, aes(x = Destination, y = Count, fill = Airport)) +
geom_bar(stat="identity", position="dodge") +
coord_flip() +
scale_fill_brewer() +
labs(x = 'Destination', y = 'Count', title = "People prefer GMP airport while travelling to Japan and China", subtitle = "People prefer ICN airport while travelling to South East Asia and other airports") +
theme_classic()+
theme(legend.title = element_text(size=10),
plot.title = element_text(hjust = 0, face = 'bold',color = 'black'),
plot.subtitle = element_text(face = "italic"))
airport_province_bargraph <- airport_survey_data %>%
group_by(Airport, Province) %>%
summarize (Count = n())
ggplot(data = airport_province_bargraph, aes(x = Province, y = Count, fill = Airport)) +
geom_bar(stat="identity", position="dodge") +
coord_flip() +
scale_fill_brewer() +
labs(x = 'Province', y = 'Count', title = "More travelled via GMP from Seoul,Kyungsang-do,Kyungki-do,Chungcheong-do", subtitle = "More travelled via ICN from Kangwon-do and Jeju") +
theme_classic()+
theme(legend.title = element_text(size=10),
plot.title = element_text(hjust = 0.6, color = 'black'),
plot.subtitle = element_text(face = "italic"))
airport_transport_travelTime_graph <- airport_survey_data %>%
group_by(Airport, ModeTransport) %>%
summarize (Average_Travel_Time = mean(`Travel Time, minutes`))
ggplot(data = airport_transport_travelTime_graph, aes(x = ModeTransport, y = Average_Travel_Time, fill = Airport)) +
#geom_bar(stat="identity", position="dodge") +
geom_bar(stat = "identity", position = position_dodge2(preserve = "single", padding = 0)) +
coord_flip() +
scale_fill_brewer() +
labs(x = 'Mode of Transport', y = 'Average Travel Time (in minutes)', title = "Average Travel Time to ICN is more") +
theme_classic()+
theme(legend.title = element_text(size=10),
plot.title = element_text(hjust = 0.6, face = 'bold', color = 'black'),
plot.subtitle = element_text(face = "italic"))
ggplot(dat1, aes(fill=Airline, y=Airfare, x=Airline)) +
geom_bar(position="dodge", stat="summary", fun.y="mean")
chisq.test(table(airport_survey_data$Airport, airport_survey_data$Airline))
##
## Pearson's Chi-squared test
##
## data: table(airport_survey_data$Airport, airport_survey_data$Airline)
## X-squared = 30.529, df = 3, p-value = 1.068e-06
chisq.test(table(airport_survey_data$Airport, airport_survey_data$Age))
##
## Pearson's Chi-squared test
##
## data: table(airport_survey_data$Airport, airport_survey_data$Age)
## X-squared = 77.248, df = 57, p-value = 0.03835
chisq.test(table(airport_survey_data$Airport, airport_survey_data$Destination))
##
## Pearson's Chi-squared test
##
## data: table(airport_survey_data$Airport, airport_survey_data$Destination)
## X-squared = 183.63, df = 4, p-value < 2.2e-16
chisq.test(table(airport_survey_data$Airport, airport_survey_data$Airfare))
##
## Pearson's Chi-squared test
##
## data: table(airport_survey_data$Airport, airport_survey_data$Airfare)
## X-squared = 237.84, df = 90, p-value = 2.739e-15
t.test(airport_survey_data$Airfare[airport_survey_data$Airport == "ICN"], airport_survey_data$Airfare[airport_survey_data$Airport == "GMP"])
##
## Welch Two Sample t-test
##
## data: airport_survey_data$Airfare[airport_survey_data$Airport == "ICN"] and airport_survey_data$Airfare[airport_survey_data$Airport == "GMP"]
## t = 1.5608, df = 401.61, p-value = 0.1194
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -4.102883 35.718697
## sample estimates:
## mean of x mean of y
## 69.37374 53.56583
t.test(airport_survey_data$Age[airport_survey_data$Airport == "ICN"], airport_survey_data$Age[airport_survey_data$Airport == "GMP"])
##
## Welch Two Sample t-test
##
## data: airport_survey_data$Age[airport_survey_data$Airport == "ICN"] and airport_survey_data$Age[airport_survey_data$Airport == "GMP"]
## t = -3.1916, df = 467.59, p-value = 0.00151
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -6.338893 -1.507751
## sample estimates:
## mean of x mean of y
## 37.79913 41.72245
t.test(airport_survey_data$`Travel Time, minutes`[airport_survey_data$Airport == "ICN"], airport_survey_data$`Travel Time, minutes`[airport_survey_data$Airport == "GMP"])
##
## Welch Two Sample t-test
##
## data: airport_survey_data$`Travel Time, minutes`[airport_survey_data$Airport == and airport_survey_data$`Travel Time, minutes`[airport_survey_data$Airport == "ICN"] and "GMP"]
## t = 0.82831, df = 454.09, p-value = 0.4079
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -8.255885 20.285900
## sample estimates:
## mean of x mean of y
## 106.1048 100.0898
# Compute the analysis of variance
res.aov <- aov(Airfare ~ Airport + Airline + Gender + Destination, data = airport_survey_data)
# Summary of the analysis
summary(res.aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## Airport 1 29578 29578 2.553 0.1107
## Airline 3 59951 19984 1.725 0.1610
## Gender 2 1159 580 0.050 0.9512
## Destination 4 170079 42520 3.670 0.0059 **
## Residuals 463 5363580 11584
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(dat1)
## Airport Airline Age Gender
## Length:474 Length:474 Min. :17.00 Length:474
## Class :character Class :character 1st Qu.:29.00 Class :character
## Mode :character Mode :character Median :37.00 Mode :character
## Mean :39.83
## 3rd Qu.:50.00
## Max. :80.00
##
## Nationality TripPurpose TripDuration GroupTravel
## Length:474 Length:474 Min. : 0.00 Min. :0.0000
## Class :character Class :character 1st Qu.: 4.00 1st Qu.:0.0000
## Mode :character Mode :character Median : 5.00 Median :0.0000
## Mean : 24.59 Mean :0.1624
## 3rd Qu.: 8.00 3rd Qu.:0.0000
## Max. :730.00 Max. :1.0000
##
## FrequentFlightDestination Destination DepartureTime
## Length:474 Length:474 Length:474
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## SeatClass Airfare NoTransport ModeTransport
## Length:474 Min. : 3.00 Min. :1.000 Length:474
## Class :character 1st Qu.: 36.22 1st Qu.:1.000 Class :character
## Mode :character Median : 46.83 Median :1.000 Mode :character
## Mean : 61.20 Mean :1.319
## 3rd Qu.: 58.00 3rd Qu.:2.000
## Max. :999.00 Max. :4.000
##
## AccessCost AccessTime Distance Province
## Min. : 0 Min. : 4.00 Min. : 23.72 Seoul :182
## 1st Qu.: 999 1st Qu.: 30.00 1st Qu.: 46.00 Kyungki-do :126
## Median : 1300 Median : 50.00 Median : 61.00 Jeju : 89
## Mean : 6785 Mean : 62.36 Mean :154.93 Kyungsang-do : 27
## 3rd Qu.: 8000 3rd Qu.: 64.00 3rd Qu.:363.00 Incheon : 22
## Max. :350000 Max. :390.00 Max. :455.00 Chungcheong-do: 13
## (Other) : 15
## Travel Time, minutes Average Price a_fare
## Min. : 39 Min. : 999 Min. : 14.00
## 1st Qu.: 56 1st Qu.: 999 1st Qu.: 42.17
## Median : 64 Median : 999 Median : 46.83
## Mean :103 Mean :166373 Mean : 61.20
## 3rd Qu.:180 3rd Qu.:362160 3rd Qu.: 52.72
## Max. :311 Max. :529260 Max. :999.00
##
summary(airport_survey_data)
## Airport Airline Age Gender
## GMP:245 Asiana Airlines :106 Min. :17.00 Female :249
## ICN:229 Foreign Airlines:137 1st Qu.:29.00 Male :222
## Korean Air :150 Median :37.00 Not Mentioned: 3
## Korean LCC : 81 Mean :39.83
## 3rd Qu.:50.00
## Max. :80.00
##
## Nationality TripPurpose TripDuration GroupTravel
## China : 32 Business:102 Min. : 0.00 Min. :0.0000
## Japan : 40 Leisure :317 1st Qu.: 4.00 1st Qu.:0.0000
## Korea :368 Other : 29 Median : 5.00 Median :0.0000
## Other : 14 Study : 26 Mean : 24.59 Mean :0.1624
## Southeast Asia: 20 3rd Qu.: 8.00 3rd Qu.:0.0000
## Max. :730.00 Max. :1.0000
##
## FrequentFlightDestination Destination DepartureTime
## Length:474 999 : 2 12pm - 6pm:205
## Class :character China :131 6am - 12pm: 46
## Mode :character Japan :155 6pm - 9pm :194
## Other : 22 9pm - 6am : 29
## Southeast Asia:164
##
##
## SeatClass Airfare NoTransport ModeTransport
## 999 : 4 Min. : 3.00 1:339 Other : 2
## Business : 34 1st Qu.: 36.22 2:120 Private Car :106
## Economy :424 Median : 46.83 3: 14 Private Taxi:176
## First Class: 12 Mean : 61.20 4: 1 Public Bus : 50
## 3rd Qu.: 58.00 Public Train: 39
## Max. :999.00 Subway :101
##
## AccessCost AccessTime Distance Province
## Min. : 0 Min. : 4.00 Min. : 23.72 Seoul :182
## 1st Qu.: 999 1st Qu.: 30.00 1st Qu.: 46.00 Kyungki-do :126
## Median : 1300 Median : 50.00 Median : 61.00 Jeju : 89
## Mean : 6785 Mean : 62.36 Mean :154.93 Kyungsang-do : 27
## 3rd Qu.: 8000 3rd Qu.: 64.00 3rd Qu.:363.00 Incheon : 22
## Max. :350000 Max. :390.00 Max. :455.00 Chungcheong-do: 13
## (Other) : 15
## Travel Time, minutes Average Price a_fare
## Min. : 39 Min. : 999 Min. : 14.00
## 1st Qu.: 56 1st Qu.: 999 1st Qu.: 42.17
## Median : 64 Median : 999 Median : 46.83
## Mean :103 Mean :166373 Mean : 61.20
## 3rd Qu.:180 3rd Qu.:362160 3rd Qu.: 52.72
## Max. :311 Max. :529260 Max. :999.00
##
write.xlsx(dat1, 'export_from_code.xlsx')
## Note: zip::zip() is deprecated, please use zip::zipr() instead
# Write CSV in R
write.csv(dat1, file = "export_from_code.csv",row.names=FALSE)
Skim summary statistics
n obs: 466
n variables: 8
| variable | missing | complete | n | n_unique | top_counts | ordered |
|---|---|---|---|---|---|---|
| Airline | 0 | 466 | 466 | 4 | Kor: 148, For: 133, Asi: 105, Kor: 80 | FALSE |
| Airport | 0 | 466 | 466 | 2 | GMP: 241, ICN: 225, NA: 0 | FALSE |
| DepartureTime | 0 | 466 | 466 | 4 | 12p: 203, 6pm: 191, 6am: 43, 9pm: 29 | FALSE |
| Destination | 0 | 466 | 466 | 4 | Sou: 160, Jap: 154, Chi: 131, Oth: 21 | FALSE |
| Nationality | 0 | 466 | 466 | 5 | Kor: 361, Jap: 39, Chi: 32, Sou: 20 | FALSE |
| Province | 0 | 466 | 466 | 8 | Se: 181, Ky: 122, Je: 86, Ky: 27 | FALSE |
| variable | missing | complete | n | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Age | 0 | 466 | 466 | 39.71 | 13.59 | 17 | 29 | 37 | 50 | 80 | ▃▇▆▅▃▃▂▁ |
| Airfare | 0 | 466 | 466 | 49.24 | 24.52 | 3 | 36.22 | 46.83 | 56 | 260 | ▃▇▁▁▁▁▁▁ |